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(5) An iterative technique for phrase query formation and an information retrieval system employing 
same. 

(5?) An information retrieval system and method 
are provided in which an operator inputs (110) 
one or more query words which are used to 
determine a search key (120) for searching (130) 
through a corpus of documents, and which 
returns (140) any matches between the search 
key and the corpus of documents as a phrase 
containing the word data matching the search 
key (the query word(s)), a non-stop (content) 
word next adjacent to the matching word data, 
and all intervening stop-words between the 
matching word data and the next adjacent 
non-stop word. The operator, after reviewing 
one or more of the returned phrases can then 
use one or more of the next adjacent non- 
stop-words as new query words to reformulate 
the search key (150, 160, 170) and perform a 
subsequent search through the document cor- 
pus. This process can be conducted iteratively, 
until the appropriate documents of interest are 
located. The additional non-stop-words from 
each phrase are preferably aligned with each 
other (e.g., by columnation) to ease viewing of 
the " new" content words. 
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The present invention relates to information re- 
trieval systems. More particularly, this invention re- 
lates to method and apparatus for assisting an oper- 
ator in forming a phrase query for searching through 
a library of documents. 

Due to the ever increasing affordabiiity and ac- 
cessibility of very large, online, text collections, Infor- 
mation Access, the science of processing natural lan- 
guage texts for the purposes of search and retrieval, 
has been the focus of heightened attention in the last 
few years, although researchers have been active in 
the field since the early sixties. Numerous approach- 
es have been attempted, but they all suffer from the 
obvious difficulty that information access is quintes- 
sential^ a cognitive task. The degree of automatic 
language understanding required for a complete sol- 
ution is dearly outside the bounds of current technol- 
ogy. Instead, heuristic search techniques attempt to 
match an admittedly incomplete query description 
with an admittedly incomplete set of features extract- 
ed from the texts of interest. The interest therefore 
lies in the development of procedures that more ef- 
fectively bridge the gap between an individual's par- 
tially stated desires and a universe of text, which ap- 
pears, computationally, as a sequence of uninterpret- 
ed words. 

Many of these procedures are statistical in na- 
ture. They take advantage of repeated occurrences of 
the same word to infer relations between documents, 
and between queries and documents. (A "document" 
need not correspond to any particular organization. It 
might be a chapter in a book, a section within a chap- 
ter, or an individual paragraph. However, as defined 
herein, a set of documents forming a corpus is an ex- 
haustive and disjoint partition of that corpus.) For ex- 
ample, similarity search induces a "relevance" order- 
ing on the text collection by scoring each document 
with a normalized sum of importance weights as- 
signed to each word in common between it and the 
query, where the importance weights depend upon 
document and collection, or corpus, frequencies. A 
more formal approach scores documents with their 
estimated probability of relevance to the query by 
adopting a text model which assumes word occur- 
rences are sequentially uncorrelated and training on 
a set of known relevant documents. In contrast, poly- 
semy (one word having multiple senses) and word 
correlation is directly addressed by latent Semantic 
Indexing, which attempts to exact characteristic linear 
combinations through a singular value decomposition 
of a word co-occurrence matrix. The availability of in- 
terdocument similarity measures suggests clustering, 
which has been pursued both as an accelerator for 
conventional search and as a query broadening tool. 
Finally, linear discriminant analysis has been de- 
ployed to classify documents based on a training set 
which matches features, including word overlap and 
word positioning, with relevance to previous queries. 



Another suite of techniques attempt to enrich the 
basic feature set by annotating words with their lexi- 
cal and syntactic functions. For example, fast lookup 
algorithms from computational linguistics reduce 

5 words to their stems. See 1. Karttunen et al, "A com- 
piler for two- level phonological rules", Report CSLI- 
87-108, Center for the Study of language and Infor- 
mation, 1987. Hidden Markov Modeling has been 
successfully employed to reintroduce part-of-speech 

10 tags given a lexicon with greater than 95% accuracy. 
See J. Kupiec, "Augmenting a hidden markov model 
for phrase-dependent word tagging". Proceedings of 
the 1989 DARPA Speech and Natural language 
Workshop, Cape Cod, MA, October 1989. An exten- 
ts sion of this technique, known as the inside-outside al- 
gorithm, promises a method for inducing a stochastic 
grammar given sufficient training text. See J.K. Bak- 
er, "Trainable grammars for speech recognition", 
speech Communication Papers for the 97th Meeting 

20 of the Acoustical Society of America, pages 547-550, 
1979; and T. Fujisaki et al, "A probabilistic method for 
sentence disambiguation", Proceedings of the Inter- 
national Workshop on Parsing Technologies, August 
1989. less ambitious procedures aim at robustly ex- 

25 trading noun phrases given a sequence of part-of- 
speech tags. Word co-occurrence relations have 
been exploited to order alternatives in cases of lexical 
ambiguity. Non-parametric classification procedures 
have been used to detect sentence boundaries in the 

30 face of typographic ambiguity. See M. Riley, "Some 
applications of tree-based modelling to speech and 
language", Proceedings of the DARPA Speech and 
Natural language Workshop, Cape Cod, MA, pages 
339-352, October 1989. 

35 A typical information access scenario involves a 
corpus of natural language text documents, and a 
user with an information need. The task is to satisfy 
that need, usually by delivering one or more relevant 
documents from the corpus of interest This is accom- 

40 piished by extracting from each document a feature 
set, and providing the user with a tool which allows 
search over these features in some prescribed fash- 
ion. For example, a standard boolean search techni- 
que assumes the feature set is one or more words ex- 

45 tracted from the text of the document, and the query 
language is boolean expressions involving those 
words. See IBM Germany, Stuttgart, "Storage and In- 
formation Retrieval Systems (STAIRS)", April 1972. 
Since it is anticipated that the corpus may be very 

so large, construction of a feature index by preprocess- 
ing each document is a standard search accelerator. 
See G. Saltan, "Automatic Text Processing", Addison- 
Wesley, 1989. 

Conventional search techniques are cast within a 

55 framework that might be referred to as the library au- 
tomation paradigm. It is presumed that the cost for 
evaluating a query is sufficiently high that a single 
iteration must return as high quality, and as complete, 
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a response as possible. This is in keeping with online 
systems that charge for connect time, and is reflected 
in evaluation criteria that discount the cost of query 
formulation and measure the precision and recall lev- 
els for the ranked set of documents which is implicitly 
presumed to be the result Ironically, the best im- 
provements to date, with respect to these criteria, 
come from an incremental query reformulation tech- 
nique, known as relevance feedback. See G. Salton 
et al, "Improving Retrieval Performance by Rele- 
vance Feedback", Journal of American Society for In- 
formation Science, 41 (4): 288-297, June 1990. 

Boolean keyword search is a well-known search 
technique in information retrieval. Essentially, a set of 
terms, typically individual words, or word stems, is ex- 
tracted from the unrestricted text of each document in 
a larger corpus. Search then proceeds by forming a 
boolean expression in terms of these keywords which 
is resolved by finding the set of documents that sat- 
isfy that expression. For example, a typical query 
might consist of the conjunction of two search terms. 
Documents that contain both terms in any order and 
any position would then be returned. Disjunction and, 
less frequently, negation are also likely to be support- 
ed. 

Unconstrained boolean search represents a 
document as a set of keywords; sequence informa- 
tion is ignored. Proximity search paradigms modify 
this representation by placing non-boolean nearness 
constraints on otherwise standard boolean queries. A 
proximity operator is introduced that demands that 
two given search terms occur within some given dis- 
tance (expressed as a number of characters or a num- 
ber of words) in order for the basic conjunction to be 
satisfied. For example, the sample query above may 
be narrowed by requesting that the two search terms 
appear within one word of each other, in any order or, 
alternatively, in the given order. 

Proximity search enables the user to form 
phrase- like queries; that is a combination of terms is 
treated as a search unit. This assumes considerable 
importance when one recalls that a query is a repre- 
sentation of an "information need". Often the con- 
cepts inherent in this information need are not ex- 
pressible as single words. Instead, phrases and even 
complete sentences must be employed to fully disam- 
biguate the thought. Conjunctive boolean queries al- 
low for the expression of these sorts of combination, 
but they also clearly over-generate. Higher precision 
is achievable by making use of nearness constraints. 
At the other extreme, complete specification of term 
order may also be detrimental since it is a property of 
most natural languages (including English), that 
phrasal units may be rewritten in multiple ways with- 
out a change in meaning. For example "dog's ankle" 
and "ankle of a dog" express the same concept. 
Hence, the application of proximity constraints must 
be strong enough to filter out disconnected occur- 



rences, yet flexible enough to account for trivial lan- 
guage variations. 

Text retrieval may be thought of as iterative query 
refinement Each stage involves a query specif ica- 

5 tion followed by a resolution. If the results of the 
query satisfy the information need of the user, the 
process ends. Otherwise, a new, modified, and pre- 
sumably more appropriate query must be formulated, 
and the process iterated. The results of the previous 

io steps inform the query reformulation at each stage. 
Traditional applications of boolean or proximity 
search provide little support for this reformulation 
process. These traditional applications usually only 
generate a candidate set of documents which satisfy 

is the search criterion. The user must then judge the ef- 
fectiveness of the query by perusing these docu- 
ments, a potentially time consuming operation, espe- 
cially if document titles are insufficient to disambig- 
uate relevant from non-relevant hits. In fact, there is 

20 empirical evidence that boolean searches resolve 
into two classes, those whose result sets contain only 
a very few hits (narrow query), and those that result 
in a great many hits (broad query). In the case of only 
a few hits, the user is left with the uncomfortable feel- 

25 ing that something may have been missed, which 
leads to a desire to broaden the existing query. Arriv- 
ing at an appropriate broadening can be elusive since 
no particular.alternative is suggested by the search 
results themselves. If the query is over broadened, 

30 the user is presented with far too many hits, and the 
task of separating out the relevant documents from 
the mass becomes daunting. 

The problem here is that the user is provided with 
little or no assistance in query reformulation. H.P. Frei 

35 et al, in "Caliban: Its user-interface and retrieval algo- 
rithm", Technical Report 62. Institut fur Informatik, 
ETH, Zurich, April 1985, discloses a dictionary of 
available search terms which can aid the search for 
alternative terminology, as well as an online domain 

40 specific thesaurus. However, often, the help of a 
highly trained intermediary, such as a research librar- 
ian, is required to derive a desirable reformulation. 

One solution is to provide enough information 
about each hit that the user can rapidly determine the 

45 contextual usage, and hence arrive at a relevance 
judgment and, possibly, a reformulation, without nec- 
essarily scanning the entire text. Paper-based key- 
word-in-context indices (sometimes known as per- 
muted indices) offer a solution for the case of single 

so term queries. The user enters the index with a single 
term, the word of interest and finds, arranged alpha- 
betically, single lines of context for each instance of 
that term, with the search term columnated at the 
center of each line. See H.P. luhn, "Keyword- in -Co n- 

55 text for Technical literature", ASDD Report RC-127, 
IBM Corporation, Yorktown Heights, N.Y., August, 
1959. The choice of an alphabetic sort key for the 
lines of context can be less than optimal if one is 
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searching for related phrases containing the search 
key, since the words determining the phrase will typ- 
ically be placed around the key, rather than heading 
the line. An alternative sort key that captures this in- 
tuition, has been employed successfully in the gener- 5 
ation of an index to titles in Statistics and Probability. 
See I.C. Ross and J.W. Tukey, n Index to Statistics 
and Probability : Permuted Titles", Volumes 3 and 4 
of the Information Access Series, R&D Press, Los 
Altos, CA., 1975; also available from the American 10 
Mathematical Society, Providence, R.I. Computer- 
ized versions of these sorts of indices exist in a vari- 
ety of different forms, yet few, if any, elaborate on the 
basic query and display strategy. 

Accordingly, a need exists for a search tool which is 
will assist an operator in formulating a search query, 
particularly when the operator has little information 
about the corpus of documents which the operator is 
searching. 

U.S. Patent No. 4,823,306 to Barbie et ai disclo- 20 
ses a method for retrieving, from a library of docu- 
ments, documents that match the content of a se- 
quence of query words, and for assigning a relevance 
factor to each retrieved document The method com- 
prises the steps of: defining a set of equivalent words 25 
for each query word and assigning to each equivalent 
word a corresponding word equivalence value; locat- 
ing target sequences of words in a library document 
that match the sequence of query words in accor- 
dance with a set of matching criteria evaluating a sim- 30 
ilarity value for each of the target sequences of words 
as a function of the corresponding equivalence val- 
ues of words included therein; and obtaining a rele- 
vance factor for the library document based upon the 
similarity values of its target sequences. 35 

U.S. Patent No. 4,972,349 to Weinberger disclo- 
ses an interactive, iterative information retrieval and 
analysis system wherein a "table of contents" organ- 
ized as a standard outline in some similarly graphic 
format, is dynamically generated in response to spe- 40 
erf ic search requests. Documents satisfying the 
search request are categorized based upon the exis- 
tence of predefined key words therein. The table of 
contents is organized into key word categories, sub- 
categories, sub-sub-categories, etc. The analysis 45 
process can be repeated for a specific category or 
sub-category of the table of contents to derive a new 
table of contents which is more focused and limited. 

It is an object of the present invention to provide 
a search retrieval system and method which assists so 
an operator in search key (query) formulation. 

It is another object of the present invention to pro- 
vide an information retrieval system and method 
which guides an operator through a set of likely rele- 
vant phrases as they occur in a target corpus to assist 55 
the operator in query formulation. 

It is another object of the present invention to pro- 
vide an information retrieval system and method 



which exposes an operator to variations in phrasal 
statements incorporating terms of interest, leading to 
a judgment of which of these phrases best capture 
the desired informational need. 

It is a further object of the present invention to 
provide an information retrieval system and method 
which identifies text fragments occurring in a corpus 
of documents which are more specific than an input 
search key (query) and which are presented in a man- 
ner which assists the operator in formulating further 
search keys. 

To achieve the foregoing and other objects, and 
to overcome the shortcomings discussed above, an 
information retrieval system and method are provided 
in which an operator inputs one or more query words 
which are used to determine a search key for search- 
ing through a corpus of documents, and which returns 
any matches between the search key and the corpus 
of documents as a phrase containing the word data 
matching the search key, a non-stop (content) word 
next adjacent to the matching word data, and all in- 
tervening stop-words between the matching word 
data and the next adjacent non-stop-word. The oper- 
ator, after reviewing one or more of the returned 
phrases can then use one or more of the next adja- 
cent non-stop-words as new query words to reformu- 
late the search key and perform a subsequent search 
through the document corpus. This process can be 
conducted iteratively, until the appropriate docu- 
ments of interest are located. 

In one embodiment, only one non-stop-word (lo- 
cated immediately adjacent to the search key) is re- 
turned along with the query word(s) in each phrase. 
The additional non-stop-words from each phrase are 
preferably aligned with each other (e.g., by columna- 
tion) to ease viewing of the "new" content words. The 
aligned additional non-stop-words can be displayed 
in a distinctive form (e.g., highlighted) so that the new 
aspect of the returned word data is emphasized, rath- 
er than the old. 

Separate phrases can be returned to display the 
non-stop-word on each side of each search key 
match. If an operator desires to view additional text 
(word data) associated with a returned phrase, an 
"extend" command is provided which causes the 
phrase to be extended from the displayed additional 
non-stop-word to the next adjacent non-stop-word. 
Alternatively, uninteresting phrases can be deleted 
from the display by providing a "forget" command. 
The length of each phrase can extend up to the entire 
length of a line on the display screen so that a max- 
imum amount of context is provided, while still focus- 
ing operator attention on the next adjacent non-stop- 
word with the columnation and highlighting display 
features. 

The search key is usually formed as a boolean 
conjunction between the query words with a proxim- 
ity constraint of no more than one intervening non- 
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stop-word (proximity constraint of one). Additional 
non-stop-words are then returned on one or both 
sides of any search key matches. If multiple query 
words are input, and the match includes the query 
words separated by a non-stop-word, that non-stop- 
word is returned as the new (highlighted) content 
word. 

The present invention can also be used to search 
through a corpus of documents which are in a lan- 
guage different from the language in which the query 
words are input. 

The invention will be described in detail with ref- 
erence to the following drawings in which like refer- 
ence numerals refer to like elements and wherein: 
Figure 1 is a block diagram of hardware compo- 
nents useable to practice the present invention; 
Figure 2 is a high level flow diagram of the search 
process according to the present invention; 
Figure 3 is a view of a display screen on a retrieval 
system operating in accordance with the present 
invention; 

Figure 4 is a view of a query formulation panel of 
the Figure 3 display screen; 
Figure 5 is a view of the text phrase review panel 
of the Figure 3 display screen; 
Figure 6 is a view of the query formulation panel 
containing a reformulated query; 
Figure 7 is the text phrase review panel resulting 
from the Figure 6 query reformulation; and 
Figure 8 is a portion of a phrase review panel il- 
lustrating the "extend" operation of the present 
invention. 

A. Overview 

The availability of high interaction user interfaces 
on modern workstations should adjust previous mod- 
els of information retrieval systems. The present in- 
vention, by using a high interaction user interface 
brings the user back into the loop by making interac- 
tion between the user and partial search results an 
explicit component of query resolution. The user is 
employed as an active filtering and query reformula- 
tion agent, which is only plausible if one presumes 
rapid response to user intervention. The present in- 
vention is a form of guided boolean search with prox- 
imity, and an associated browsing tool, which exem- 
plifies these principles, and which provides the oper- 
ator with an amount of information in a format which 
is appropriate to assist in query formulation. 

The present invention makes use of an interac- 
tive user interface. The basic underlying assumption 
is that short queries, consisting of a few search 
terms, are by their very nature radically incomplete. 
Hence, query repair and elaboration through user in- 
teraction and iteration are essential to achieve ade- 
quate recall. This can be achieved through a high in- 
teraction interface which rapidly delivers results to 



the user in a way that can be quickly appreciated, and 
by offering a search method whose operation is intui- 
tive and which offers information as to which next 
step will be most effective in achieving a desired re- 
5 suit 

The present invention addresses these issues by 
allowing the user to directly inspect the space of 
phrases generated by a set of terms (query words) of 
interest The intention is to aid query reformulation by 

10 exposing the user to the range of variation present in 
the target corpus. For example, a search performed 
by the present invention which is keyed by the single 
term "information" might display phrases such as " in- 
formation storage and retrieval ", "advances in infor- 

15 mation retrieval ", "sensory information", and "genet- 
ic information" among others, each of which is guar- 
anteed to occur in the target corpus. This additional 
information can be used by the operator to formulate 
on the next query. 

20 From the user's perspective, the present inven- 
tion resembles a phrase search facility where the 
search keys are treated as constituents (query 
words), and completions are returned which contain 
these and new constituents, organized in a fashion 

25 that emphasizes the new rather than the old. Search 
key formulation consists of specifying one or more 
"constituents" (query words) in a way that requires lit- 
tle or no query syntax. These constituents are then 
matched against the corpus using a heuristic which 

30 interprets them as a boolean conjunction with a prox- 
imity constraint. Then, instead of returning matching 
documents and treating the search as if it were com- 
plete, as would a standard boolean search, the pres- 
ent invention returns phrases including matches em- 

35 bedded in a surrounding textual context These 
phrases are intended to contain sufficient context to 
disambiguate usage, but not so much text as to dis- 
tract the reader or clutter the display. 

The current heuristic returns the text surrounding 

40 the search terms (the query words) plus one other 
•significant" word, where significance is operationally 
defined by not being on some prespecif ied list of non- 
topic bearing words (a stop list). The neighboring con- 
tent word (the next non-stop- word adjacent to the 

45 query words) provides disambiguating context and 
can be highlighted in the display to draw the user's at- 
tention to what is new, rather than what was input (the 
query word(s)). Additionally, all stop-words located 
between the displayed non-stop-words (i.e., the 

so query word(s) and the next significant word) are also 
displayed. If the context is insufficient to disambig- 
uate usage, the user is encouraged to ask for more 
(an operation called "extend"). If the context shows a 
word combination which is 

55 a priori 

uninteresting, all phrases with similar word structure 
can be deleted (an operation called "forget")- The "for- 
get" operation is, in effect, boolean negation by ex- 
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ample. 

Since the time constants associated with each of 
these operations can be made small, the overall ef- 
fect is to encourage incremental query reformulation 
based on occurrences as they appear in the corpus 5 
of interest. In the case where the disambiguating con- 
text is sufficient to indicate that the returned phrase 
is indeed relevant the user may proceed directly to 
the corresponding document. 

10 

B. Implementation 

The present invention can be implemented in an 
information retrieval system as illustrated by block di- 
agram in Figure 1 . The information retrieval system 1 s 
includes a central processing unit (microprocessor) 
10 for receiving signals from, and outputting signals 
to various other components of the system, according 
to one or more programs run on microprocessor 10. 
The system includes a read only memory (ROM) 14 20 
for storing operating programs. A random access 
memory (RAM) 18 is provided for running the various 
operating programs, and additional files 22 could be 
provided for overflow and the storage of indexed text 
used by the present invention in performing a search 25 
operation. 

Prior to performing a search, a target text corpus 
is input from a data base input 24, and is processed 
by an indexing engine 28 which extracts context 
words (ignoring words on a stop list) in each docu- 30 
mentof the target corpus. Optionally, the indexing en- 
gine can also normalize the context words, for exam- 
ple, through the performance of stemming opera- 
tions. There are numerous prior art stemming algo- 
rithms. For example, stemming can be performed by 35 
using a dictionary-based exact inflectional morphol- 
ogy analyzer (an algorithm which only strips endings 
which do not change the part of speech, for example 
"s" and "ed"). Alternatively, tail cropping procedures 
which may produce non-words can be used. These al- 40 
gorithms often also consider derivational morphology 
(endings that change the part of speech, such as "ly" 
and •lion") as well as inflectional morphology. One 
might imagine additional normalizations, such as the 
replacement of words by thesaurus classes, the tag- 45 
ging of words with their part-of-speech, or the anno- 
tation of words with their syntactic and semantic 
roles. 

Monitor 36 is provided for displaying search re- 
sults, and for permitting the user to interface with the so 
operating programs. A user input device 32 such as, 
for example, a mouse, a keyboard, a touch screen or 
combinations thereof is provided for input of com- 
mands by the operator. A printer 40 can also be pro- 
vided so that hard copies of documents can be print- 55 
ed. 

An on-line multi-language dictionary 4-4 can also 
be provided for searching through a corpus of docu- 



ments in a language which is unfamiliar to the opera- 
tor. 

Figure 2 is a high level flow diagram of the proc- 
esses performed by the present invention. In step 
110, the operator inputs one or more query words. 
These words can be input in a conventional manner, 
such as, for example, by typing the appropriate words 
in a display box using a keyboard. In step 120, the 
search key is formed. In the present invention, the 
search key is the boolean conjunction of ail query 
terms with a proximity constraint of one. A query term 
is a disjunctive set of query words input by the oper- 
ator. For example, if the operator input query words 
A and B, both A and B would be treated as query 
terms. Alternatively, if the operator input the words A 
and (B or C), A would be treated as a single query 
term and (B or C) would be treated as a single query 
term. The system can automatically treat each input 
query word as a query term so that the "and" connec- 
tor need not be input by the operator. For example, "A 
B" would automatically be interpreted as "A and B w . 

In step 130, the search is performed. The search 
will return all phrases in the corpus of documents 
which include each query term and has a length 
equal to the number of query terms plus one addition- 
al non-stop-word (because of the proximity constraint 
of one). This one additional non-stop-word provides 
the operator with new information regarding the us- 
age of the search terms in that match. Instep 140, the 
returned phrases are displayed. The matches be- 
tween the corpus of documents and search key are 
displayed as phrases which include the query 
word(s), one additional next adjacent non-stop- word, 
and all intervening text (stop-words, spaces, and 
punctuation). Accordingly, the operator is provided 
with one or more phrases which provide additional in- 
formation regarding the matches (and thus the docu- 
ment associated with the matches). By selecting one 
of the displayed next adjacent non-stop-words as a 
new query word, and successively redetermining the 
search key and performing the search, phrases are 
returned having an increasing closeness to the oper- 
ator's informational need. 

In order to emphasize the new information pro- 
vided to the user, it is desirable to display multiple 
matches on the display simultaneously, with each 
match consuming at most a single line on the display, 
and with the respective next adjacent non-stop- 
words from each phrase being aligned with each 
other in a common column (this is referred to as gut- 
tering). Preferably, the additional non-stop-words are 
displayed in a distinctive form, such as, for example, 
highlighting, different from the display of the other 
word data in the displayed phrases to further empha- 
size the new information over the old information. It 
is also possible to display the query words in italicized 
form so that they can be distinguished. 

Although it is possible that the operator will be 
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able to select the appropriate document or docu- 
ments from the corpus after performing only a single 
search, it is more likely that additional operations will 
be necessary before the operator's informational 
need is satisfied. Accordingly, in step 150, an opera- 5 
tor can extend a selected phrase in the display. As il- 
lustrated by step 1 55, when a phrase is extended, the 
next adjacent non- stop-word in the selected phrase 
is removed from the gutter (i.e., the common column) 
and the immediately next adjacent non-stop-word is 10 
placed in the common column and highlighted. The 
extend operation is performed, for example, when the 
originally highlighted non-stop-word provides little or 
no content information to the operator. Accordingly, 
additional text in the phrase is provided by moving on is 
to the next non-stop- word. A phrase can be extended 
a plurality of times until the phrase consumes the en- 
tire length of a line of text on the display screen. It is 
generally desirable to maintain the query words on 
the display screen, and additionally it may be desir- 20 
able to limit the extension operation to within a single 
sentence (since the context of a word can usually be 
determined from the sentence in which it is located). 
All phrases containing the same next non-stop-word 
are extended. 25 

In step 160, a "forget" operation is performed. As 
illustrated in step 165, the forget operation results in 
the deletion of a selected phrase from the display. At! 
phrases containing the same non-stop-word are also 
deleted from the list of returned phrases. 30 

As described above, an operator can also revise 
the query words in step 170. This usually involves 
adding one or more query words to the previously 
searched list of query words. Operation then returns 
to step 1 20 and t he search and display operations are 35 
repeated until the operator ends the search in step 
180 and, for example, views (and possibly prints) de- 
sired documents. 

The information retrieval system according to the 
present invention can be extended by providing an 40 
on-line multi-language dictionary so that searches 
can be performed through a corpus of documents 
written in a language which is foreign to the operator. 
The problem with searching foreign language docu- 
ments is twofold. First, formation of the query, and 45 
second, understanding of the results. The latter is 
particularly troublesome since translating a docu- 
ment is a costly and time consuming task, even with 
machine translation aids. The present invention as- 
sists in both these problems since the query length 50 
is usually small (as small as one word), and the num- 
ber of additional context words returned for each 
match can be as low as a single word. 

For example, assuming the user is a speaker of 
English, and wants to query a corpus of French docu- ss 
ments, the user provides a pair of English words A and 
B. and specifies a corpus of documents in French. An 
English-to- French and French-to-English dictionary 



is also required. Searches proceed as follows: use 
the dictionary to translate A to a set of corresponding 
French words A1 , A2, A3..., and to translate B to a set 
of corresponding French words B1, B2, B3.... Find ail 
phrases based on these pairs in any combination 
(i.e., search (A1 or A2 or A3 or...) and (B1 or B2 or B3 
or...), with a proximity constraint of one). While many 
of these pairs do not really belong together, the cor- 
pus of documents itself will correct the error, because 
most of the pairs will simply not be found. For each 
phrase returned by this search, display both the 
French phrase, and the English language phrase 
formed by A, B and the set (C1 , C2, C3...) of possible 
translations of the French context word, C, that was 
found near the translated pair A and B. 

This technique does not require much translation. 
Since the phrases are short, local translation techni- 
ques (even using some idiom dictionaries) should 
work. The translation which is required can be done 
on the fly. In particular, documents need not be trans- 
lated in entirety ahead of time. While precision may be 
somewhat low, recall should be high, and the interac- 
tive nature of the present invention should make it 
practical. Once a document is found that looks prom- 
ising, then more sophisticated and time consuming 
tools can be used to attempt translation of larger 
document units than words. No particular language 
need be chosen a priori as the base language. Speak- 
ers of different languages can use the document cor- 
pus simultaneously, so long as dictionaries are avail- 
able to and from their language. 

In addition, the user need not know which lan- 
guage is used in the searched document corpus as 
long as the search system can identify which bilingual 
dictionaries to employ. In particular, the searched 
corpus may be multilingual, written in more than one 
language. For example, the corpus may be recent 
French, German, and Japanese patents, and an ap- 
propriate number of bilingual dictionaries can be pro- 
vided for translating words. 

EXAMPLE 

A version of the above described search para- 
digm has been implemented which reifies the strat- 
egy outlined above. In particular, the present inven- 
tion is one of the search modes supported by the Text 
Database architecture (TDB). See D.R. Cutting, J. 
Pederson, and P-K. Halvorsen, "An object-oriented 
architecture for text retrieval", in Conference Pro- 
ceedings of RIAO '91 . Intelligent Text and Image Han- 
dling, Barcelona, Spain, pages 285-298, April 1991. 
TDB is a software artifact implemented in Common 
Lisp (G.L Steele, Jr. "Common Lisp, the Language", 
Digital Press, second edition, 1990) which is directed 
towards fast prototyping of retrieval systems. A user 
interface to TDB, known as the text Browser, uses the 
Interiisp-D (Xerox Corporation, Interlisp-D Reference 
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Manual, Xerox AIS, 1987) window system to present 
a multi-paradigm text search and retrieval tool 300 
(see Figure 3), Currently, two search modes are sup- 
ported over the same corpus: similarity search and 
the phrase oriented technique of the present inven- 5 
tion. The first two panels 310, 330 concern them- 
selves with the phrase oriented technique query 
specification and the presentation of results, respec- 
tively, for the present invention. The third panel 350 
is for the scrollable display of documents. The last two 10 
panels 360, 380 are concerned with similarity search 
and are not part of the present invention. The ordering 
is not particularly significant, although it is anticipat- 
ed that the phrase oriented technique of the present 
invention will be most useful for fairly directed quer- 15 
ies, the results of which can then seed a browsing 
method, such as similarity search. 

An upper portion of panel 310 includes three 
boxes, labeled "Query", "Abort", and "Sort" over 
which a cursor can be positioned and actuated to in- 20 
put commands (described below). A box, 315, is pro- 
vided into which the operator can enter (by typing) 
query words. Additionally, a "Same Sentence" func- 
tion can be activated or deactivated by buttoning a 
mouse cursor over the "Yes" or " No" boxes, respec- 25 
tively. When activated, the "Same Sentence" function 
limits returned phrases to occurrence within a single 
sentence. A "Query Interaction" bar is provided and 
includes the boxes " Forget", "Extend", "Step", and 
"View". The " Forget" and " Extend " boxes cause 30 
those operations to be performed on a selected 
phrase by buttoning a mouse when the cursor is lo- 
cated over the appropriate box. The "Step" box caus- 
es the incremental movement of the phrase selector 
332 when activated. When phrase selector 332 is lo- 35 
cated at the bottom of display panel 330, activation of 
the "Step" box causes scrolling of the displayed 
phrases. The "View" box causes the document asso- 
ciated with the selected phrase to be viewed in the 
view screen 350. 40 

Prior to search, the target text corpus (in this ex- 
ample, Grolier's encyclopedia, 64Megabytes of ASCII 
text) was processed by an indexing engine that ex- 
tracted the context words (ignoring words on a stop 
list) in each document (in this example, an article in 45 
the encyclopedia), normalized them through the re- 
moval of inflectional morphology, and recorded their 
sequential offsets in a b-tree based inverted index. 
See, for example, D.R. Cutting and J.O. Pederson, 
■Optimizations for dynamic inverted index mainte- 50 
nance", Proceedings of SIGIR '90, September 1 990. 

Search then proceeds by specifying a set of 
words which will form the components of a phrase 
match criterion (see Figure 4). The query words are 
typed into an area (i.e., box 315) on the display 55 
screen. When satisfied with the query, the operator 
buttons the "Query" box to start the search. The 
"Abort" box can be buttoned at any time to cancel a 



search. In this example the user is interested in 
phrases that include the word "movie" (or its inflec- 
tional variations). Note that the interface reports the 
marginal frequency of the search term, and the num- 
ber of hits currently found. The query is resolved by 
interpreting it as a boolean conjunction with a prox- 
imity constraint (in this example, a proximity con- 
straint of one). A match occurs if all query terms occur 
with no more than one content word gap between 
them. In the example, since there is only one query 
term, ail instances of "movie" match. 

The result of a query is a set of text phrases, each 
satisfying the phrase match criterion (see Figure 5). 
In the example each instance of "movie" generates up 
to two overlapping phrases-one for the additional 
context word on each side of the query word (for a to- 
tal of 263). Since the sentence limitation is activated, 
some query word occurrences generate only one 
phrase. These are presented in a stylized fashion to 
aid perusal by the user. The display heuristic pres- 
ents the query terms plus one additional non-stop- 
word and all the intervening (unindexed) text, con- 
taining space, punctuation and stop- words. It is hop- 
ed that the additional non-stop-word will provide dis- 
ambiguating context. The inclusion of the intervening 
unindexed text provides useful syntactic information, 
especially function words. Up to an entire line of text 
can be returned for each phrase, with only the next 
adjacent non-stop-word being highlighted. This pro- 
vides an operator with a maximum amount of informa- 
tion, while still using the next adjacent non-stop-word 
for alignment. 

To focus the user's attention on new information, 
the phrases are formatted so that the additional non- 
stop- word is placed adjacent to an easiiy recogniz- 
able location. This has the effect of columnating 
these contexts next to a vertical strip of white space, 
known as the "gutter". The gutter word is highlighted 
with a bold font, and the query terms are distinguish- 
ed, but not emphasized, with an italic font. The final 
display is reminiscent of a keyword-in-context index, 
with the crucial difference that each gutter word is 
new information (not just part of the match criterion), 
and, may be the result of a multi-term query. 

As with boolean search, no particular ordering of 
phrases is implied by the query resolution mecha- 
nism. In practice, it is convenient to organize phrases 
so that all phrases associated with a particular docu- 
ment appear in occurrence order. If documents are 
naturally stored in some order, perhaps alphabetically 
by title, which corresponds to a particular scan order 
through an inverted index, partial results may then be 
returned before the completion of the entire query. 
This is especially useful for queries with a large num- 
ber of hits, since the user may begin perusal of the 
partial results without waiting for search termination. 
Other presentation orderings may also be useful. In 
particular, phrases may be sorted by the gutter word, 
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or by schemes that extract a sort key from the se- 
quence of content words. This could be accomplished 
either incrementally or after search termination. In 
the present example, buttoning the "Sort" box in pan- 
el 310 causes the phrases to be displayed in alpha- 5 
betical order by gutter word. When displayed in this 
manner, the document title in which each phrase is lo- 
cated is not displayed to the left of the phrases. 

In this example, the user can easily see by in- 
spection that "movie" occurs in phrases such as "si- 10 
lent movie*, "movie theater", "movie industry", as well 
as many others. To view more phrases without scroll- 
ing, the user at this stage may choose to eliminate 
phrases similar (in the sense of having the same gut- 
ter word) to the one currently selected by buttoning 15 
"forget" in the query panel. Alternatively, the user 
may narrow the query by picking one of the comple- 
tions for further study. If the user re-evaluates the 
query adding "industry" as an additional term (see 
Figure 6), twelve hits are returned (see Figure 7). 20 
Again, by inspection it is easy to see, for example, 
that the article titled "Rome" has reference to the Ital- 
ian movie industry. The phrase "movie industry oper- 
ated" is not especially revealing; however, the user 
may button "extend" to enlarge the viewed context 25 
(see Figure 8). As illustrated by Figure 8, when "ex- 
tend" is selected, the previous gutter word ("operat- 
ed") is de- highlighted, and the next non-stop-word 
("code") is aligned at the gutter. Any one of the phras- 
es may be selected, and the associated document 30 
viewed (with the phrase highlighted) by buttoning 
"view" in the query panel. 

Once a document is viewed, the similarity search 
operation can be performed. Similarity search is well 
known and not a part of the present invention. It can 35 
be implemented by prior art techniques. The similarity 
search user interface panel provides four selection 
boxes: "Selection", " Feedback", "Abort" and "View" 
to the operator. "Selection" causes a similarity search 
to be performed on the highlighted paragraph in the 40 
view panel 350. "Feedback" causes a similarity 
search to be performed on the entire document in the 
display screen. "Abort" and "View" function as descri- 
bed above. 

45 

ALGORITHMS 

Algorithms are now provided which permit the 
present invention to be performed on a corpus of 
documents stored in an inverted index. This permits so 
the extraction of possible searchable phrases from a 
target corpus represented as strings of words. 

It is presumed that each document, d, in a larger 
corpus is a sequence of words, 

d = «WJ,...,M^ 55 

where n d is the number of words instances in docu- 
ment d. It will be convenient in the following to con- 



sider each word occurrence as a word interval of 
length 1. That is, let 

d = {wjw; +1 wp, 

then 

d = {(d,1,1).(d,2,2) (d,n d , n d )}. 

In the case of intervals of length one, let (d,s) = 
(d,s,s). 

An inverted map can be produced by preprocess- 
ing each document This map identifies each word 
with the length one intervals that contain it, l(w) = 
{(c^M^) (d^.nfl, 

KKM+^j («A'S>> 

where dT is the i 1 ** document containing an instance 
of w, sj is the word offset of the j the instance of w in 
d" n w a counts the number of instances of w in d, and 
n w is the number of documents in which w occurs. If 
there exists an ordering on documents, < , (we can al- 
ways construct such an ordering), then we will require 
that l(w) is ordered as follows: 

and 

s;<s£ ///<*. 

In this setting, it is natural to define disjunction as a 
merge operation on sequences of word intervals. 
That is, the result of a disjunctive query q * {w%wV 
,...»w%} is defined to be: 



|oj/(u,*) 



where denotes an n-ary merge operation on or- 
dered sequences, as can be implemented by a priority 
queue in time proportional to (log n^I^'K^i)'- See 
D. Knuth, The Artof Computer Programming", Vol. 3: 
Sorting and Searching, Addison Wesley, 1973. 

Similarly, conjunction with proximity can be seen 
as a specialized merge operation. Suppose q is sat- 
isfied by a sequence of words if every word wo, occurs 
at least once in the sequence, and the total length of 
the sequence is no more than Iql + p, where 0 is 
the proximity parameter, let l ( = l(w^), and define f j to 
be the j th « interval in l H Set q = 1 for all i, and let f, = 
f d , initially the first interval in I,, let the Ifs be ordered 
by considering theffs: 

h<tjifff,<fi, 

Let (d,,Si) refer to f H Consider the following algorithm: 

0 Result = 

1 Sort the 1,'s 

2ffd, = dj, 1 gij^n q and 
n q -1 
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IS*,- S,£p 
i = 1 

then append (d 1# s 1f Snq) to Result 

3 set c, = c, ♦ 1 

4 If c 1 >ll(w^ 1 )l return Result else goto 1 

As defined here, not every interval that satisfies 
the query condition is necessarily returned; in cases 
where two candidate intervals share left edges, only 
the shorter will be selected. For example, suppose 
the query pattern is "xy" and p = 1 , then the sequence 
"xyy" will generate only one result interval, although 
two could be found. It is possible, with the addition of 
backtracking, to modify this algorithm to be fully cor- 
rect 

In the worst case, the inner loop of this algorithm 
is executed 2^ ll(w^)l times, while the cost of step 
[1] is proportional to n q log n q , hence the overall time 
complexity of this algorithm is proportional to (n q log 
n q ) S^ =1 ll(w< t )ll. 

The above algorithms will return phrases of 
length q or q + 1 , when p = 1 . Phrases of length q can 
be extended to q + 1 by adding a word to the left or 
the right. Thus, whenever a single query word is input, 
and sentence boundaries are not considered, each 
occurrence of the query word in the stored, indexed 
text can return two phrases. The first phrase will in- 
clude the next adjacent non-stop-word on one side of 
the query word (e.g., the right side) and the second 
phrase will include the next adjacent non-stop- word 
located on the other side of the query word (e.g., the 
left side). If each phrase is displayed as an entire line 
of text, there will be considerable overlap between 
the two phrases returned for each match. According- 
ly, it may be desirable to display only one phrase for 
each match, particularly, when an entire line of text 
is displayed. 

When a plurality of query words are input, the re- 
turned phrase will include a number of non-stop- 
words equal to the number of query words plus one 
when p = 1. If all of the query words are adjacent to 
one another, two phrases could be returned for each 
match as described above. However, if there is a one 
word space within the match, the non-stop-word (the 
gutter word) will be located in that space, and conse- 
quently, only a single phrase will be returned for that 
match. 

The "extend" operation proceeds as follows. 
When an operator decides to extend a phrase, the 
gutter word of that phrase is added to the stop-list, 
and the returned list of phrases is modified (or reeval- 
uated). This is much faster than re-performing the 
search over the entire corpus of documents with an 
augmented stop-list. Accordingly, the reevaluated list 
of phrases can be quickly displayed to the user. Ad- 
ditionally, if the entire corpus of documents were re- 
searched with the gutter word added to the stop list, 
phrases which were not returned previously could be 
returned. 



When a new query term is added to the search, 
all previous extensions are forgotten. However, it is 
possible to add the previous non-stop words from ex- 
tended phrases to the stop-list for all future searches 

5 over the entire corpus of documents. 

The "forget" operation functions similar to the 
"extend" operation in that only the returned phrases 
are reevaluated. However, with the "forget" function, 
the gutter word of the forgotten phrase is treated as 

10 boolean negation over the displayed phrases. It 
should be noted, however, that re-searching the en- 
tire corpus of documents excluding the forgotten word 
would not return any new results (however, it would 
take longer than re-evaluation). 

15 As defined herein, a search is performed over 
the entire corpus of documents, and is a result of a 
query formulation operation. In order to quickly pro- 
vide the operator with results of the forget" and "ex- 
tend" functions, "forget" and "extend" are not treated 

20 as searches. Instead, they are treated as revalua- 
tions which take place over the set of returned phras- 
es (not over the entire corpus). 

C. Possible Extensions 

25 

The present invention can be extended in a vari- 
ety of ways. First, the current heuristic for choosing 
the nearby disambiguating content word could be im- 
proved by statistically evaluating the likely topic de- 

30 termining value of a list of nearby candidate words. 
This could be accomplished either by considering im- 
portance weights (as defined by similarity search), or 
by computing a dispersion measure based on a clus- 
tering of the corpus. 

35 If a stochastic part-of-speech tagger were avail- 

able it could be employed in at least two ways. Since 
part-of-speech tagging can be sense disambiguating 
(for example, "package" as a noun has quite a differ- 
ent sense than "package" as a verb), the strategy 

40 would be to segregate (or sort) returned phrases 
based on the inferred part of speech of the query 
terms. Another use would feed a tagged extended 
context to a noun phrase recognizer in order to select 
a syntactically coherent subset for display purposes. 

45 The present invention is most useful in generat- 
ing candidate phrases given a single term query. In 
this case, it may not be necessary to generate an ex- 
haustive listing. Instead, homomorphic phrases could 
be represented as a single paradigm. This reduction 

so to equivalence classes would expose the variation 
present in the corpus more readily than the listing of 
repeated instance of the same (or similar) phrases. 

Multi-term queries can be over constraining. 
Some form of automatic broadening may be appropri- 

55 ate if only a few hits are found. This could be accom- 
plished by selectively weakening the match criterion 
until, at the extreme, it becomes a disjunction, rather 
than a conjunction. Such a strategy would differen- 
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tially weight returned phrases based on the degree of 
match - and sort them accordingly. 

While the present invention is described with ref- 
erence to a preferred embodiment, the particular em- 
bodiment is intended to be illustrative, not limiting. Va- 
rious modifications may be made without departing 
from the scope of the invention as defined in the ap- 
pended claims. 



Claims 

1. A method of selectively searching an automated 
data base with data processing apparatus, said 
data base containing a corpus of documents com- 
prising sequences of word data stored as stop- 
words and non-stop-words in a memory, said 
method comprising the steps of: 

a) inputting ( 110) to said data processing ap- 
paratus at least one query word; 

b) determining (120) a word data search key 
based upon said at least one query word; 

c) searching (130) said document corpus to 
identify all occurrences of a match between 
said search key and said document corpus 
word data; 

d) displaying (140) each match as a phrase 
containing the word data matching said 
search key, a non-stop- word next adjacent to 
said matching word data, and all intervening 
stop-words between said matching word data 
and said next adjacent non-stop-word; and 

e) selecting ( 150, 160, 170) one of said next 
adjacent non- stop-words as a new query 
word and successively repeating steps b)-d) 
using the selected new query word to locate 
documents of interest from said document 
corpus. 

2. The method of claim 1, wherein said search key 
is said at least one query word. 

3. The method of claim 1, wherein a stemming op- 
eration is performed on said at least one query 
word to determine said search key. 

4. The method of claim 1, wherein the phrases for 
multiple matches are displayed simultaneously 
and so that the respective non-stop-words are 
aligned with each other in a common column. 

5. The method of claim 4, wherein said non-stop- 
words are displayed in a distinctive form different 
from the display of the other word data in the dis- 
played phrases. 

6. The method of claim 1, further comprising: 

extending, responsive to an inputted com- 



mand from an operator, each displayed phrase to 
include the word data in the word data sequence 
containing the displayed phrase up to the non- 
stop-word next adjacent to the last displayed 
5 non-stop-word. 

7. The method of claim 1, further comprising: 

deleting a selected phrase from said dis- 
played phrases responsive to an inputted com- 
10 mand from an operator. 

8. The method of claim 1, wherein said corpus of 
documents are in a first language and said at 
least one query word is in a second language, dif- 

15 ferent from said first language, and wherein said 
search key determining step includes the steps 
of: 

i) translating said at least one query word from 
said second language to a corresponding at 

20 least one set of at least one corresponding 

query word in said first language; and 

ii) defining said search key so as to corre- 
spond to a boolean disjunction between the 
corresponding query words within each set of 

25 corresponding query words. 

9. The method of daim 1, wherein when a plurality 
of query words are input by an operator, said word 
data search key determining step includes: 

30 determining said word data search key as 

a boolean conjunction between said plurality of 
input query words with a proximity constraint of 
one. 

35 10. The method of daim 1, wherein said displaying 
step indudes displaying one non-stop-word on 
both sides of each match. 

11. The method of daim 1, wherein when a plurality 
40 of query words are input by an operator, said dis- 
playing step indudes; 

displaying a non-stop-word on at least one 
side of said match if said match includes said 
query words located immediately adjacent to 
45 each other; and 

displaying one non-stop-word between 
said query words if said match indudes said 
query words within one word of each other. 

50 12. The method of daim 1, wherein said selection of 
the next adjacent non-stop-word as a new query 
word and said successive repetition of steps b)- 
d) is performed responsive to a search command 
denoted by a cursor selection of a displayed next 

55 adjacent non-stop-word. 

13. A document retrieval system storing a corpus of 
documents comprising sequences of word data 
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stored as stop-words and non-stop-words in a 
memory, including an apparatus for selectively 
searching through the corpus of documents, the 
apparatus comprising: 

means (110) for receiving at least one 5 
query word input by an operator of the document 
retrieval system; 

means (130) for searching through said 
document corpus and identifying all occurrences 
of a match between said document corpus word 10 
data and a search key determined (120) based 
upon said at least one query word; and 

means (140) for displaying each match as 
a phrase containing the word data matching said 
search key, a single non-stop-word next adjacent is 
to said matching word data, and all intervening 
stop-words between said matching word data and 
said single non-stop-word. 

20 
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